#libraries needed
library(ggplot2)
library(gridExtra)
library(lme4)
1. Fitting a linear model
The file sales1.csv consists of quarterly sales volumes (in % and indexed to the time 0) of a product.
a. Plotting the data
data_path <- "/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/sales1.csv"
sales1 <- read.csv(file=data_path)
pl1 <- ggplot(sales1) +
geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
ylab("% Sales Volumes") +
xlab("time")
pl1 + ggtitle("Quarterly sales volumes")

From the plot we can see that the quarterly sales seem to have and increasing growth over the years. Since is not completely obvious the type of relation there is between the time and the sales volume we need to contruct a polynomial model that will allow us to understand this relationship.
b. Fitting a polynomial model
We are not sure from the beggining on the degree of the polynomial model we need to fit our data. Therefore, we are going to build several different models to which one of the fits better the data and gives as the best approximation for the sales volume.
Polynomial of degree 0
For this model we assume that the quarterly sales volumes don’t depend on time. Therefore, we only use the sales data as an input for our model. \[y_j=\beta_0\quad; \quad1≤j≤n\]
lm0 <- lm(y ~ 1, data = sales1)
summary(lm0)
Call:
lm(formula = y ~ 1, data = sales1)
Residuals:
Min 1Q Median 3Q Max
-12.292 -5.304 -1.355 5.055 13.299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.772 1.612 69.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.389 on 20 degrees of freedom
\[y_j=111.78; \quad1≤j≤n\]
We only get one coefficient (\(\beta_0\)) that basically corresponds to the average of the quartely sales volume.
pl1 + geom_hline(yintercept=coef(lm0)[1],size=1, colour="Coral") + ggtitle("Polynomial Model (Degree 0)")

To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
par(mfrow = c(1, 2))
plot(lm0, which=c(1,2))

We can confirm from the prediction plot and the residuals plot how this model does not fit our data. Since we are predicting for each quarter the average value, we can see from how scatter the residual and how they are not all close to zero. Hence the model does not work to fit this data.
Polynomial of degree 1
Now we are going to assume now a linear trend, defined by: \[y_j=\beta_0+\beta_1x_j+e_j \quad; \quad1≤j≤n\]
Where \(x_j\) and \(y_j\) references, respectively, the n measures of time and sales volumes.
lm1 <- lm(y ~ time, data=sales1)
summary(lm1)
Call:
lm(formula = y ~ time, data = sales1)
Residuals:
Min 1Q Median 3Q Max
-3.6994 -1.5777 -0.3596 1.6444 3.4747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.99688 0.94971 105.29 < 2e-16 ***
time 0.37985 0.02643 14.37 1.17e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.2 on 19 degrees of freedom
Multiple R-squared: 0.9158, Adjusted R-squared: 0.9113
F-statistic: 206.5 on 1 and 19 DF, p-value: 1.166e-11
coef(lm1)
(Intercept) time
99.9968826 0.3798515
\[y_j=99.99+0.38x_j+e_j \quad; \quad1≤j≤n\]
pl1 + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="lightseagreen") +
ggtitle("Polynomial Model (Degree 1")

To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
par(mfrow = c(1, 2))
plot(lm1, which=c(1,2))

The residual plot suggests that the residuals are not identically distributed around 0. From the QQ plot we can infer that it may be necessary to improve the regression model since the extreme residual values are not the extreme values of a normal distribution.
Polynomial of degree 2
Given the previous results we can try describe the extreme values using a polynomial of higher degree. So we are going to try now to fit a polynomial of degree 2. \[y_j=\beta_0+\beta_1x_j+\beta_2{x_j}^2+e_j \quad; \quad1≤j≤n\]
lm2 <- lm(y ~ time + I(time^2), data=sales1)
summary(lm2)
Call:
lm(formula = y ~ time + I(time^2), data = sales1)
Residuals:
Min 1Q Median 3Q Max
-3.9334 -1.4679 -0.1228 1.5395 3.4804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.006e+02 1.426e+00 70.521 < 2e-16 ***
time 3.209e-01 1.065e-01 3.013 0.00747 **
I(time^2) 9.511e-04 1.662e-03 0.572 0.57422
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.24 on 18 degrees of freedom
Multiple R-squared: 0.9173, Adjusted R-squared: 0.9081
F-statistic: 99.77 on 2 and 18 DF, p-value: 1.818e-10
coef(lm2)
(Intercept) time I(time^2)
1.005970e+02 3.208831e-01 9.511046e-04
\[y_j=100.59+0.32x_j+0.0095{x_j}^2+e_j \quad; \quad1≤j≤n\]
f <- function(x,c) coef(lm2)[1] +
coef(lm2)[2]*x +
coef(lm2)[3]*(x^2)
pl1 + stat_function(fun=f, colour="gold", size=1)+
ggtitle("Polynomial Model (Degree 2)")

To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
par(mfrow = c(1, 2))
plot(lm2, which=c(1,2))

We can see an improvement in the residuals, since they are now closely distributed around zero
To have a better understanding of how the models perform and choose the proper one, we are going to perform the Akaike information criterion and Bayesian information criterion.
AIC(lm0,lm1,lm2)
BIC(lm0,lm1, lm2)
We are going to choose the model with lowest AIC and BIC. From the results we can see how both criteria agree for rejecting lm0 with high confidence, and both have a slight preference for lm1 over lm2. Since the interpretation for a linear model is simpler than for a second degree polynomial, we are going to choose lm1 as the model that fits our data.
c. Adding a periodic component
We ara going to try to improve our model by adding a periodic component. Since we are not sure what periodic component is the one that will help us have the best improvement to the model, we are going to try three different ones and then test which one fits better the data.
Periodic component \(cos(\frac{2\pi time}{12})\)
lm_p1 <- lm(data = sales1, y ~ time + I(cos(2*pi*time/12)))
f <- function(x,c) coef(lm_p1)[1] +
coef(lm_p1)[2]*x +
coef(lm_p1)[3]*cos(2*pi*x/12)
pl1 + stat_function(fun=f, colour="gold", size=1)+
ggtitle("Polynomial Model with periodic componet cos()")

Periodic component \(sin(\frac{2\pi time}{12})\)
lm_p2 <- lm(data = sales1, y ~ time + I(sin(2*pi*time/12)))
f <- function(x,c) coef(lm_p2)[1] +
coef(lm_p2)[2]*x +
coef(lm_p2)[3]*cos(2*pi*x/12)
pl1 + stat_function(fun=f, colour="lightblue", size=1)+
ggtitle("Polynomial Model with periodic componet sin()")

Periodic component \(cos(\frac{2\pi time}{12}) + sin(\frac{2\pi time}{12})\)
lm_p3 <- lm(data = sales1, y ~ time + I(cos(2*pi*time/12)+sin(2*pi*time/12)))
f <- function(x,c) coef(lm_p3)[1] +
coef(lm_p3)[2]*x +
coef(lm_p3)[3]*cos(2*pi*x/12)
pl1 + stat_function(fun=f, colour="lightgreen", size=1)+
ggtitle("Polynomial Model with periodic componet cos() + sin()")

We are going to perform now the AIC and BIC test in order to check which model fits better our data.
AIC(lm_p1,lm_p2,lm_p3)
BIC(lm_p1,lm_p2,lm_p3)
We can see how is suggested by AIC and BIC that lm_3 is the model that best describes the behaviour of the quarterly sales volume.
coef(lm_p3)
(Intercept) time
99.7966320 0.3826308
I(cos(2 * pi * time/12) + sin(2 * pi * time/12))
1.7539826
Therefore, the model that describes our data is define as following: \[y_j=99.69+0.38x_j+1.75*\bigg[cos\bigg(\frac{2 \pi x_j}{12}\bigg)+sin\bigg(\frac{2 \pi x_j}{12}\bigg)\bigg]+e_j \quad; \quad1≤j≤n\]
d. Validating the results
f <- function(x,c) coef(lm_p3)[1] +
coef(lm_p3)[2]*x +
coef(lm_p3)[3]*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl1 + stat_function(fun=f, colour="coral", size=1) + ggtitle("Quarterly Volume Sales Prediction vs Real values")

par(mfrow = c(1, 1))
plot(lm_p3, which=c(1,1))

We can see how the residuals we shift the distribution of the residuals from the obe we obtained for the regression model without the periodic component. We still get a fairly distribution of the residuals around zero. We can agreee that this model fits better our data that the one we defined before.
e. Adding a constraint on the intercept
We want the predicted sales volume to be equal to 100 at time 0. Therefore we redefined our model as:
lm_periodic2 <- lm(data = sales1, I(y-100) ~ 0 + time+ I(cos(2*pi*time/12)+sin(2*pi*time/12)))
summary(lm_periodic2)
Call:
lm(formula = I(y - 100) ~ 0 + time + I(cos(2 * pi * time/12) +
sin(2 * pi * time/12)), data = sales1)
Residuals:
Min 1Q Median 3Q Max
-1.94138 -0.89474 -0.03139 0.83742 2.17444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
time 0.37775 0.00701 53.891 < 2e-16 ***
I(cos(2 * pi * time/12) + sin(2 * pi * time/12)) 1.74827 0.24682 7.083 9.73e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.153 on 19 degrees of freedom
Multiple R-squared: 0.9937, Adjusted R-squared: 0.993
F-statistic: 1495 on 2 and 19 DF, p-value: < 2.2e-16
f <- function(x,c) 100 + coef(lm_periodic2)[1] * x +
coef(lm_periodic2)[2] * (cos(2*pi*x/12)+sin(2*pi*x/12))
pl1 + stat_function(fun=f, colour="lightcoral", size=1)

par(mfrow = c(1, 1))
plot(lm_periodic2, which=c(1,1))

Adding the constraint over the intercept has a sligth impact on the fitting of the model. Which makes sense since we are not trying the model to fit the intercept to the data available, instead we are forcing it to a defualt value of 100.
2. Fitting a linear mixed effects model Now we hve the data of quarterly sales volumes of 30 different products.
a. Plotting the data
sales30 <- read.csv("/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/sales30.csv")
pl2 <-ggplot(sales30) + geom_point(aes(x=time, y=y),size=1, colour="Royalblue") + ggtitle("Quarterly sales volumes")
print(pl2)

pl3 <- ggplot(sales30) +
geom_point(aes(x=time, y=y),size=1, colour="Royalblue") +
ggtitle("Quarterly sales volumes") +
facet_wrap(~id, ncol=6)
print(pl3)

b. Fitting a model to the first product
We are going to fit the model used previously for fitting the first series to this data.
lm_periodic3 <- lm(data = subset(sales30, id==1), y ~ time + I(cos(2*pi*time/12)+sin(2*pi*time/12)))
summary(lm_periodic3)
Call:
lm(formula = y ~ time + I(cos(2 * pi * time/12) + sin(2 * pi *
time/12)), data = subset(sales30, id == 1))
Residuals:
Min 1Q Median 3Q Max
-1.86773 -0.74963 0.07146 0.90215 2.12213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.79663 0.50996 195.697 < 2e-16 ***
time 0.38263 0.01418 26.992 5.16e-16 ***
I(cos(2 * pi * time/12) + sin(2 * pi * time/12)) 1.75398 0.25288 6.936 1.76e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.18 on 18 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.4 on 2 and 18 DF, p-value: 1.758e-15
f <- function(x,c) coef(lm_periodic3)[1] +
coef(lm_periodic3)[2]*x +
coef(lm_periodic3)[3]*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl3 <-ggplot(subset(sales30,id==1)) + geom_point(aes(x=time, y=y),size=2, colour="Royalblue") + ggtitle("Quarterly sales volumes first series")
pl3 + stat_function(fun=f, colour="lightcoral", size=1)

par(mfrow = c(1, 1))
plot(lm_periodic3, which=c(1,1))

We can see how we get more or less the same results as the ones we obtained with the previous data. So we can agree that this models suits us for predicting the sales volume of one product on the data set.
c. Fitting a mixed effect model
We have been able to create a model that describes pretty accurately the prediction of the sales volumes of one product. Now if we want to construct a model for discribing the overall behaviour of the sales volume for the different products its important to take into account the effect that each product may have over the prediciont. From the start is not clear to know over which component of the model the type of product will have and impact. Therefore we are going to define several different models assuming all the scenarios where factors could depend on the product. Then we are going to test and choose the model that best fits our data.
Model 1: Nothing depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
Model 2: The value of the intercept (sales volume at time 0) depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+e_{ij}\]
Model 3: The growth rate per quarter depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+e_{ij}\]
Model 4: The intercept and the growth rate per quarter depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i1}*x_{ij}+e_{ij}\]
Model 5: The periodic component depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
Model 6: The intercept and the periodic component depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
Model 7: The growth rate per quarter and periodic component depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
Model 8: Evereything depends on the product: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
m1 <- lm(y ~ time + I(cos(2*pi*time/12) + sin(2*pi*time/12)), data=sales30)
m2 <- lmer(y ~ time + I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (1|id), data=sales30)
m3 <- lmer(y ~ time + I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1 + time|id) , data=sales30)
m4 <- lmer(y ~ 1 + time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (time|id) , data=sales30)
m5 <- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30)
m6 <- lmer(y ~ 1 +time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30)
m7 <- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30)
m8 <- lmer(y ~ 1+ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30)
BIC(m1,m2,m3,m4, m5,m6,m7, m8)
The BIC suggest then that the model that best fit our data and explains the effect each product has over the general model is model 7. This models suggest that each products introduce a random effect in the growth rate and also on the periodic component but the intercept remains the same for all the products.
m7
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ time + I(cos(2 * pi * time/12) + sin(2 * pi * time/12)) +
(-1 + time + I(cos(2 * pi * time/12) + sin(2 * pi * time/12)) | id)
Data: sales30
REML criterion at convergence: 2788.325
Random effects:
Groups Name Std.Dev. Corr
id time 0.1376
I(cos(2 * pi * time/12) + sin(2 * pi * time/12)) 1.3184 -0.35
Residual 1.8468
Number of obs: 630, groups: id, 30
Fixed Effects:
(Intercept) time
100.1225 0.2155
I(cos(2 * pi * time/12) + sin(2 * pi * time/12))
1.0666
Therefore the mixed effect model that describes our data is defined as following: \[y_{ij}=100.12+0.21*x_{ij}+1.06*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\]
d. Plotting the predicted values
We are going to plot the data with the predicted sales given by our final model.
prediction <- fitted(m7)
ggplot(data=sales30) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) +
geom_line(aes(x=time,y=prediction), color="Royalblue") + facet_wrap(~id, ncol=6)

qqnorm(residuals(m7))

As we can see the plots above, our model fits pretty well every one of the 30 products since the quantile plot does not raise any significant concern with normality of the residuals.
d. Adding a constraint over the intercept
We are going to set the predicted sales volume of all products equal to 100 at time 0.
constraint_model <- lmer(I(y-100) ~ 0 +time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id), data = sales30)
f<-fitted(constraint_model)
ggplot(data=sales30) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) +
geom_line(aes(x=time,y=f+100), color="Royalblue") + facet_wrap(~id, ncol=6)

qqnorm(residuals(constraint_model))

Once again we see how adding the constraint, still gets us a well fitted model and as we can see in the quantile plot it does not have a negative impact on the effectiveness of our model. 2. Individual prediction Now we are going to see how well does our model perform for predicting the sales volume for a new product.
data_path <- "/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/salesNew.csv"
salesNew <- read.csv(file=data_path)
a. Prediction without any data
Without any information on how are the sales for this new product we will assume that we can predict the sales with the fixed parts of the model we previously defined.
Given that our model is: \[y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}\] We will assume that \(\eta_{i0}\), \(\eta_{i1}\), and \(\eta_{i2}\) are zero since we dont have anydata to define the random effect this product will have over the population values.
Therefore our equation for prediction the values of this product will be: \[y_{ij}=100.12+0.21*x_{ij}+1.06*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]\] (Intercept) time
I(cos(2 * pi * time/12) + sin(2 * pi * time/12))
f <- function(x,c) 100.1225 +
0.2155*x +
1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl4 <- ggplot(salesNew) +
geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
ylab("% Sales Volumes") +
xlab("time")
print(pl4 + ggtitle("Quarterly sales volumes new product") + stat_function(fun=f, colour="lightcoral", size=1))

As we can see, the fixed part of our model gives us an approximation of the sales volume for the product. But its clear that there is a need on having some data from the previous sales behaviour of product, to fit or train our model to predict more accurately the sales for this new product. Or in other words to understand the random effect that this product has on the “population” growth and periodical component.
b. Prediction with one data point for the new product
To adjust the model we are going to fit again the mixed effect model with the data from the new product. Since we only have data for one period (time=1), then we are going to assume the population values for the other quarters and train the model with this data.
New_Data<-sales30
New_Product<-as.data.frame(cbind(matrix(31, ncol = 1, nrow = nrow(salesNew)), salesNew$time))
f <- function(x,c) 100.1225 +
0.2155*x +
1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
New_Product<-cbind(New_Product, lapply(New_Product[2], f))
colnames(New_Product) <- c("id", "time","y")
#We create a new dataframe with the new product, the values that we used for the sales volume (y) are the ones predicted by the fixed
New_Data<-rbind(New_Data, New_Product)
#We update the value of time 1 for the new product with the known value
time1<-salesNew$y[salesNew$time == 1 ]
New_Data$y[New_Data$time==1 & New_Data$id==31]<-time1
#Now we fit again the mixed effect model with this new data
New_model_d1<- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=New_Data)
To validate how the new model fits the new product, we are going to obtain the betas or coefficients from the mixed effect model for the new product (id=31) and then contruct the function to predict the sales volume. With this information we are going to plot our prediction vs the actual values to see how well the model performs.
betas<-cbind(coef(New_model_d1)$id, matrix(c(1:31),ncol=1, nrow=31))
colnames(betas) <- c("b_0", "b_1","b_2", "id")
f_d1 <- function(x,c) betas$b_0[betas$id==31] +
betas$b_1[betas$id==31]*x +
betas$b_2[betas$id==31]*(cos(2*pi*x/12)+sin(2*pi*x/12))
print(pl4 + ggtitle("Quarterly sales volumes new product") + stat_function(fun=f_d1, colour="lightcoral", size=1))

And compute the residual sum of squares to understand how good the model performs.
Final_Data <- as.data.frame((salesNew[2]-lapply(salesNew[1], f))^2)
PredictionError <- sum(Final_Data$y)
PredictionError
[1] 1250.952
We can see that still the model does not gather enough information to correctly defined the random effect that the new product has over the fixed model
c. Prediction with increasing data for the new product
Now we are going to perform the same computation that we did before, but now adding at a time one known value of the sales volume. To see how adding information to the model helps as improve the fitting of the model.
New_Data<-sales30
New_Product<-as.data.frame(cbind(matrix(31, ncol = 1, nrow = nrow(salesNew)), salesNew$time))
f <- function(x,c) 100.1225 +
0.2155*x +
1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
New_Product<-cbind(New_Product, lapply(New_Product[2], f))
colnames(New_Product) <- c("id", "time","y")
#We create a new dataframe with the new product, the values that we used for the sales volume (y) are the ones predicted by the fixed
New_Data<-rbind(New_Data, New_Product)
iter=0
for (i in salesNew$time)
{
iter=iter+1
#We update the value of y with then known sales valume for the i quarter
sales_iquarter<-salesNew$y[salesNew$time == i ]
New_Data$y[New_Data$time==i & New_Data$id==31]<-sales_iquarter
#We fit the model to the new data
New_model_dt<- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=New_Data)
#we obtain the coefficients and re-construct the model for the new product (id=31)
betas<-cbind(coef(New_model_dt)$id, matrix(c(1:31),ncol=1, nrow=31))
colnames(betas) <- c("b_0", "b_1","b_2", "id")
f <- function(x,c) betas$b_0[betas$id==31] +
betas$b_1[betas$id==31]*x +
betas$b_2[betas$id==31]*(cos(2*pi*x/12)+sin(2*pi*x/12))
#We compute the new prediction for the sales volume and store in a dataframe the results of our iteration
if (i==1){
Results<-as.data.frame(cbind(matrix(toString(iter), ncol = 1, nrow = nrow(salesNew)),salesNew$time,salesNew$y, as.data.frame(lapply(salesNew[1], f))))
colnames(Results)<-c("iteration","time","y","prediction")
} else{
temp<-as.data.frame(cbind(matrix(toString(iter), ncol = 1, nrow = nrow(salesNew)),salesNew$time,salesNew$y, as.data.frame(lapply(salesNew[1], f))))
colnames(temp)<-c("iteration","time","y","prediction")
Results<-rbind(Results,temp)
}
}
Having the results for each iteration of adding known values to fit the model, we can now plot the predicted values per iteration against the actual values.
pl5 <- ggplot(Results) +
geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
geom_line(aes(x=time, y=prediction,col=iteration)) +
ylab("% Sales Volumes") +
xlab("time")
pl5 + ggtitle("Predicted Values per Iteration vs Actual Values")

We also comput the sum of residual errors to be able to compare the results of each model
Results["SR"]<-as.data.frame((Results$y-Results$prediction)^2)
SSR<-aggregate(Results$SR, by=list(iteration=Results$iteration), FUN=sum)
ggplot(SSR) + geom_point(aes(x=iteration, y=x),size=2, colour="red") + ggtitle("Convergence of SSR per Iteration")

We see how having more and more information of our new product help the model to have a better fit of the data and therefore have lower residual error
We can plot the final fit of the mixed effect model with all the data available:
f<-fitted(New_model)
ggplot(data=New_Data) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) +
geom_line(aes(x=time,y=f), color="Royalblue") + facet_wrap(~id, ncol=6)

We can see how the model manages to gather the effects of the new product and at the same time does not damages the predictions for the other products.
---
title: "CADENA Maria, Case Study 2"
output:
  html_document: default
  html_notebook: default
  pdf_document: default
---

```{r}
#libraries needed
library(ggplot2)
library(gridExtra)
library(lme4)
```
</b>
</b>
<font size="5" color="seagreen"><b>
*1. Fitting a linear model*
</b></font><br/>
</b>
The file sales1.csv consists of quarterly sales volumes (in % and indexed to the time 0) of a product.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*a. Plotting the data*
</font><br/>
```{r}
data_path <- "/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/sales1.csv"
sales1    <- read.csv(file=data_path)
pl1       <- ggplot(sales1) + 
             geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
             ylab("% Sales Volumes") +
             xlab("time")
pl1 + ggtitle("Quarterly sales volumes")
```
<p align="justify">
From the plot we can see that the quarterly sales seem to have and increasing growth over the years. Since is not completely obvious the type of relation there is between the time and the sales volume we need to contruct a polynomial model that will allow us to understand this relationship.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*b. Fitting a polynomial model*
</font><br/>
We are not sure from the beggining on the degree of the polynomial model we need to fit our data. Therefore, we are going to build several different models to which one of the fits better the data and gives as the best approximation for the sales volume.<br/>
<br/>
<b>*Polynomial of degree 0*</b>
<br/>
For this model we assume that the quarterly sales volumes don't depend on time. Therefore, we only use the sales data as an input for our model.
$$y_j=\beta_0\quad; \quad1≤j≤n$$
```{r}
lm0 <- lm(y ~ 1, data = sales1)
summary(lm0)
```
$$y_j=111.78; \quad1≤j≤n$$
<br/>We only get one coefficient ($\beta_0$) that basically corresponds to the average of the quartely sales volume.<br/>
```{r}
pl1  + geom_hline(yintercept=coef(lm0)[1],size=1, colour="Coral") + ggtitle("Polynomial Model (Degree 0)")
```
<br/> To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
```{r}
par(mfrow = c(1, 2))
plot(lm0, which=c(1,2))
```
<br/>We can confirm from the prediction plot and the residuals plot how this model does not fit our data. Since we are predicting for each quarter the average value, we can see from how scatter the residual and how they are not all close to zero. Hence the model does not work to fit this data.<br/>
<br/>
<b>*Polynomial of degree 1*</b>
<br/>
Now we are going to assume now a linear trend, defined by:
$$y_j=\beta_0+\beta_1x_j+e_j \quad; \quad1≤j≤n$$
<br/>Where $x_j$ and $y_j$ references, respectively, the n measures of time and sales volumes.<br/>
```{r}
lm1 <- lm(y ~ time, data=sales1)
summary(lm1)
```
```{r}
coef(lm1)
```
$$y_j=99.99+0.38x_j+e_j \quad; \quad1≤j≤n$$
```{r}
pl1  + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="lightseagreen") +
       ggtitle("Polynomial Model (Degree 1")
```
<br/> To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
```{r}
par(mfrow = c(1, 2))
plot(lm1, which=c(1,2))
```
<br/>The residual plot suggests that the residuals are not identically distributed around 0. From the QQ plot we can infer that it may be  necessary to improve the regression model since the extreme residual values are not the extreme values of a normal distribution.<br/>
<br/>
<b>*Polynomial of degree 2*</b>
<br/>
Given the previous results we can try describe the extreme values using a polynomial of higher degree. So we are going to try now to fit a polynomial of degree 2.
$$y_j=\beta_0+\beta_1x_j+\beta_2{x_j}^2+e_j \quad; \quad1≤j≤n$$
<br/>
```{r}
lm2 <- lm(y ~ time + I(time^2), data=sales1)
summary(lm2)
```
```{r}
coef(lm2)
```
$$y_j=100.59+0.32x_j+0.0095{x_j}^2+e_j \quad; \quad1≤j≤n$$
```{r}
f <- function(x,c) coef(lm2)[1] + 
                   coef(lm2)[2]*x +
                   coef(lm2)[3]*(x^2)

pl1  + stat_function(fun=f, colour="gold", size=1)+ 
       ggtitle("Polynomial Model (Degree 2)")
```
<br/> To understand the performance of the model we are going to check the residuals against fitted values and a normal QQ plot.
```{r}
par(mfrow = c(1, 2))
plot(lm2, which=c(1,2))
```
<br/> We can see an improvement in the residuals, since they are now closely distributed around zero<br/>
<br/> To have a better understanding of how the models perform and choose the proper one, we are going to perform the Akaike information criterion and Bayesian information criterion.
```{r}
AIC(lm0,lm1,lm2)
```

```{r}
BIC(lm0,lm1, lm2)
```
<br/>We are going to choose the model with lowest AIC and BIC. From the results we can see how both criteria agree for rejecting lm0 with high confidence, and both have a slight preference for lm1 over lm2. Since the interpretation for a linear model is simpler than for a second degree polynomial, we are going to choose lm1 as the model that fits our data.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*c. Adding a periodic component*
</font><br/>
We ara going to try to improve our model by adding a periodic component. Since we are not sure what periodic component is the one that will help us have the best improvement to the model, we are going to try three different ones and then test which one fits better the data.<br/>
<b>*Periodic component* $cos(\frac{2\pi time}{12})$</b>
```{r}
lm_p1 <- lm(data = sales1, y ~ time + I(cos(2*pi*time/12)))
f <- function(x,c) coef(lm_p1)[1] + 
                   coef(lm_p1)[2]*x +
                   coef(lm_p1)[3]*cos(2*pi*x/12)

pl1  + stat_function(fun=f, colour="gold", size=1)+ 
       ggtitle("Polynomial Model with periodic componet cos()")
```
<b>*Periodic component* $sin(\frac{2\pi time}{12})$</b>
```{r}
lm_p2 <- lm(data = sales1, y ~ time + I(sin(2*pi*time/12)))
f <- function(x,c) coef(lm_p2)[1] + 
                   coef(lm_p2)[2]*x +
                   coef(lm_p2)[3]*cos(2*pi*x/12)

pl1  + stat_function(fun=f, colour="lightblue", size=1)+ 
       ggtitle("Polynomial Model with periodic componet sin()")
```
<b>*Periodic component* $cos(\frac{2\pi time}{12}) + sin(\frac{2\pi time}{12})$</b>
```{r}
lm_p3 <- lm(data = sales1, y ~ time + I(cos(2*pi*time/12)+sin(2*pi*time/12)))
f <- function(x,c) coef(lm_p3)[1] + 
                   coef(lm_p3)[2]*x +
                   coef(lm_p3)[3]*cos(2*pi*x/12)

pl1  + stat_function(fun=f, colour="lightgreen", size=1)+ 
       ggtitle("Polynomial Model with periodic componet cos() + sin()")
```
<br/> We are going to perform now the AIC and BIC test in order to check which model fits better our data. <br/>
```{r}
AIC(lm_p1,lm_p2,lm_p3)
```

```{r}
BIC(lm_p1,lm_p2,lm_p3)
```
<br/> We can see how is suggested by AIC and BIC that lm_3 is the model that best describes the behaviour of the quarterly sales volume.
```{r}
coef(lm_p3)
```
<br/>
Therefore, the model that describes our data is define as following:
$$y_j=99.69+0.38x_j+1.75*\bigg[cos\bigg(\frac{2 \pi x_j}{12}\bigg)+sin\bigg(\frac{2 \pi x_j}{12}\bigg)\bigg]+e_j \quad; \quad1≤j≤n$$
<br/>
<font size="3" color="MediumSeaGreen">
*d. Validating the results*
</font><br/>
```{r}
f <- function(x,c) coef(lm_p3)[1] + 
                   coef(lm_p3)[2]*x + 
                   coef(lm_p3)[3]*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl1  + stat_function(fun=f, colour="coral", size=1) + ggtitle("Quarterly Volume Sales Prediction vs Real values")
```
```{r}
par(mfrow = c(1, 1))
plot(lm_p3, which=c(1,1))
```
<br/> We can see how the residuals we shift the distribution of the residuals from the obe we obtained for the regression model without the periodic component. We still get a fairly distribution of the residuals around zero. We can agreee that this model fits better our data that the one we defined before.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*e. Adding a constraint on the intercept*
</font><br/>
We want the predicted sales volume to be equal to 100 at time 0. Therefore we redefined our model as:
```{r}
lm_periodic2 <- lm(data = sales1, I(y-100) ~ 0 + time+ I(cos(2*pi*time/12)+sin(2*pi*time/12)))
summary(lm_periodic2)
f <- function(x,c)  100 + coef(lm_periodic2)[1] * x + 
                          coef(lm_periodic2)[2] * (cos(2*pi*x/12)+sin(2*pi*x/12))
pl1  + stat_function(fun=f, colour="lightcoral", size=1)
```
```{r}
par(mfrow = c(1, 1))
plot(lm_periodic2, which=c(1,1))
```
<br/>Adding the constraint over the intercept has a sligth impact on the fitting of the model. Which makes sense since we are not trying the model to fit the intercept to the data available, instead we are forcing it to a defualt value of 100.<br/>
<br/>
<br/>
<font size="5" color="seagreen"><b>
*2. Fitting a linear mixed effects model*
</b></font>
Now we hve the data of quarterly sales volumes of 30 different products.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*a. Plotting the data*
</font><br/>
```{r}
sales30 <- read.csv("/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/sales30.csv")
pl2     <-ggplot(sales30) + geom_point(aes(x=time, y=y),size=1, colour="Royalblue") + ggtitle("Quarterly sales volumes")
print(pl2)
```

```{r}
pl3   <- ggplot(sales30) + 
         geom_point(aes(x=time, y=y),size=1, colour="Royalblue") + 
         ggtitle("Quarterly sales volumes") + 
         facet_wrap(~id, ncol=6) 
print(pl3)
```
<br/>
<font size="3" color="MediumSeaGreen">
*b. Fitting a model to the first product*
</font><br/>
<br/>
We are going to fit the model used previously for fitting the first series to this data.
```{r}
lm_periodic3 <- lm(data = subset(sales30, id==1), y ~ time + I(cos(2*pi*time/12)+sin(2*pi*time/12)))
summary(lm_periodic3)
```
```{r}
f <- function(x,c) coef(lm_periodic3)[1] + 
                   coef(lm_periodic3)[2]*x + 
                   coef(lm_periodic3)[3]*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl3 <-ggplot(subset(sales30,id==1)) + geom_point(aes(x=time, y=y),size=2, colour="Royalblue") + ggtitle("Quarterly sales volumes first series")
pl3  + stat_function(fun=f, colour="lightcoral", size=1)
```
```{r}
par(mfrow = c(1, 1))
plot(lm_periodic3, which=c(1,1))
```
<br/>We can see how we get more or less the same results as the ones we obtained with the previous data. So we can agree that this models suits us for predicting the sales volume of one product on the data set.<br>
<br>
<font size="3" color="MediumSeaGreen">
*c. Fitting a mixed effect model*
</font><br/>
We have been able to create a model that describes pretty accurately the prediction of the sales volumes of one product. Now if we want to construct a model for discribing the overall behaviour of the sales volume for the different products its important to take into account the effect that each product may have over the prediciont. From the start is not clear to know over which component of the model the type of product will have and impact. Therefore we are going to define several different models assuming all the scenarios where factors could depend on the product. Then we are going to test and choose the model that best fits our data.<br/>
<br/>
<b>Model 1</b>: Nothing depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
<br/>
<b>Model 2</b>: The value of the intercept (sales volume at time 0) depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+e_{ij}$$
<br/>
<b>Model 3</b>: The growth rate per quarter depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+e_{ij}$$
<br/>
<b>Model 4</b>: The intercept and the growth rate per quarter depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i1}*x_{ij}+e_{ij}$$
<br/>
<b>Model 5</b>: The periodic component depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
<br/>
<b>Model 6</b>: The intercept and the periodic component depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
<br/>
<b>Model 7</b>: The growth rate per quarter and periodic component depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
<br/>
<b>Model 8</b>: Evereything depends on the product:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i0}+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$

```{r}
m1 <- lm(y ~   time + I(cos(2*pi*time/12) + sin(2*pi*time/12)), data=sales30)  
m2 <- lmer(y ~ time + I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (1|id), data=sales30) 
m3 <- lmer(y ~ time + I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1 + time|id) , data=sales30) 
m4 <- lmer(y ~ 1 + time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (time|id) , data=sales30) 
m5 <- lmer(y ~  time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30) 
m6 <- lmer(y ~ 1 +time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30) 
m7 <- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30) 
m8 <- lmer(y ~ 1+ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=sales30)

BIC(m1,m2,m3,m4, m5,m6,m7, m8)
```
</br>The BIC suggest then that the model that best fit our data and explains the effect each product has over the general model is model 7. This models suggest that each products introduce a random effect in the growth rate and also on the periodic component but the intercept remains the same for all the products.</br>
```{r}
m7
```
</br>Therefore the mixed effect model that describes our data is defined as following:</br>
$$y_{ij}=100.12+0.21*x_{ij}+1.06*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
<br/>
<font size="3" color="MediumSeaGreen">
*d. Plotting the predicted values*
</font><br/>
<br/>
We are going to plot the data with the predicted sales given by our final model.
```{r}
prediction <- fitted(m7)
ggplot(data=sales30) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) + 
  geom_line(aes(x=time,y=prediction), color="Royalblue") + facet_wrap(~id, ncol=6) 
```
```{r}
qqnorm(residuals(m7))
```
</br>As we can see the plots above, our model fits pretty well every one of the 30 products since the quantile plot does not raise any significant concern with normality of the residuals.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*d. Adding a constraint over the intercept*
</font><br/>
<br/>
We are going to set the predicted sales volume of all products equal to 100 at time 0.<br/>
```{r}
constraint_model <- lmer(I(y-100) ~ 0 +time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id), data = sales30)
f<-fitted(constraint_model)
ggplot(data=sales30) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) + 
  geom_line(aes(x=time,y=f+100), color="Royalblue") + facet_wrap(~id, ncol=6) 
```
```{r}
qqnorm(residuals(constraint_model))
```
</br>Once again we see how adding the constraint, still gets us a well fitted model and as we can see in the quantile plot it does not  have a negative impact on the effectiveness of our model.</br>
</br>
</br>
<font size="5" color="seagreen"><b>
*2. Individual prediction*
</b></font></br>
</br>
Now we are going to see how well does our model perform for predicting the sales volume for a new product.</br>
```{r}
data_path <- "/Users/mariafcadena/Documents/Master Degree/Polytechnique/Statistics in Action/Case Study 2/salesData/salesNew.csv"
salesNew    <- read.csv(file=data_path)
```
<br/>
<font size="3" color="MediumSeaGreen">
*a. Prediction without any data*
</font><br/>
<br/>
Without any information on how are the sales for this new product we will assume that we can predict the sales with the fixed parts of the model we previously defined.
<br/>
Given that our model is:
$$y_{ij}=\beta_{0}+\beta_{1}*x_{ij}+\beta_{2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+\eta_{i1}*x_{ij}+\eta_{i2}*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]+e_{ij}$$
We will assume that $\eta_{i0}$, $\eta_{i1}$, and $\eta_{i2}$ are zero since we dont have anydata to define the random effect this product will have over the population values.<br/>
<br/>
Therefore our equation for prediction the values of this product will be:
$$y_{ij}=100.12+0.21*x_{ij}+1.06*\bigg[cos\bigg(\frac{2\pi x_{ij}}{12}\bigg)+sin\bigg(\frac{2\pi x_{ij}}{12}\bigg)\bigg]$$
(Intercept)                                              time  
                                                                                      
I(cos(2 * pi * time/12) + sin(2 * pi * time/12))  
                                           
```{r}
f <- function(x,c) 100.1225 + 
                   0.2155*x + 
                   1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
pl4       <- ggplot(salesNew) + 
             geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
             ylab("% Sales Volumes") +
             xlab("time")
print(pl4 + ggtitle("Quarterly sales volumes new product") + stat_function(fun=f, colour="lightcoral", size=1))
```
<br/>As we can see, the fixed part of our model gives us an approximation of the sales volume for the product. But its clear that there is a need on having some data from the previous sales behaviour of product, to fit or train our model to predict more accurately the sales for this new product. Or in other words to understand the random effect that this product has on the "population" growth and periodical component.<br/>
<br/>
<font size="3" color="MediumSeaGreen">
*b. Prediction with one data point for the new product*
</font><br/>
<br/>
To adjust the model we are going to fit again the mixed effect model with the data from the new product. Since we only have data for one period (time=1), then we are going to assume the population values for the other quarters and train the model with this data.<br/>
```{r}
New_Data<-sales30
New_Product<-as.data.frame(cbind(matrix(31, ncol = 1, nrow = nrow(salesNew)), salesNew$time))
f <- function(x,c) 100.1225 + 
                   0.2155*x + 
                   1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
New_Product<-cbind(New_Product, lapply(New_Product[2], f))
colnames(New_Product) <- c("id", "time","y")
#We create a new dataframe with the new product, the values that we used for the sales volume (y) are the ones predicted by the fixed 
New_Data<-rbind(New_Data, New_Product)
#We update the value of time 1 for the new product with the known value
time1<-salesNew$y[salesNew$time == 1 ] 
New_Data$y[New_Data$time==1 & New_Data$id==31]<-time1
```

```{r}
#Now we fit again the mixed effect model with this new data
New_model_d1<- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=New_Data)
```
<br/>To validate how the new model fits the new product, we are going to obtain the betas or coefficients from the mixed effect model for the new product (id=31) and then contruct the function to predict the sales volume. With this information we are going to plot our prediction vs the actual values to see how well the model performs.<br/>
```{r}
betas<-cbind(coef(New_model_d1)$id, matrix(c(1:31),ncol=1, nrow=31))
colnames(betas) <- c("b_0", "b_1","b_2", "id")

f_d1 <- function(x,c) betas$b_0[betas$id==31] + 
                   betas$b_1[betas$id==31]*x + 
                   betas$b_2[betas$id==31]*(cos(2*pi*x/12)+sin(2*pi*x/12))

print(pl4 + ggtitle("Quarterly sales volumes new product") + stat_function(fun=f_d1, colour="lightcoral", size=1))
```
<br/>And compute the residual sum of squares to understand how good the model performs.<br/>
```{r}
Final_Data      <- as.data.frame((salesNew[2]-lapply(salesNew[1], f))^2)
PredictionError <- sum(Final_Data$y)
PredictionError
```
<br/>We can see that still the model does not gather enough information to correctly defined the random effect that the new product has over the fixed model <br/>
<br/>
<font size="3" color="MediumSeaGreen">
*c. Prediction with increasing data for the new product*
</font><br/>
<br/>
Now we are going to perform the same computation that we did before, but now adding at a time one known value of the sales volume. To see how adding information to the model helps as improve the fitting of the model.<br/>
```{r}
New_Data<-sales30
New_Product<-as.data.frame(cbind(matrix(31, ncol = 1, nrow = nrow(salesNew)), salesNew$time))
f <- function(x,c) 100.1225 + 
                   0.2155*x + 
                   1.0666*(cos(2*pi*x/12)+sin(2*pi*x/12))
New_Product<-cbind(New_Product, lapply(New_Product[2], f))
colnames(New_Product) <- c("id", "time","y")
#We create a new dataframe with the new product, the values that we used for the sales volume (y) are the ones predicted by the fixed 
New_Data<-rbind(New_Data, New_Product)
iter=0
for (i in salesNew$time)
{
  iter=iter+1
  #We update the value of y with then known sales valume for the i quarter
  sales_iquarter<-salesNew$y[salesNew$time == i ] 
  New_Data$y[New_Data$time==i & New_Data$id==31]<-sales_iquarter
  
  #We fit the model to the new data
  New_model_dt<- lmer(y ~ time+ I(cos(2*pi*time/12) + sin(2*pi*time/12)) + (-1+time+I(cos(2*pi*time/12) + sin(2*pi*time/12))|id) , data=New_Data)

  #we obtain the coefficients and re-construct the model for the new product (id=31)
  betas<-cbind(coef(New_model_dt)$id, matrix(c(1:31),ncol=1, nrow=31))
  colnames(betas) <- c("b_0", "b_1","b_2", "id")

  f <- function(x,c) betas$b_0[betas$id==31] + 
                     betas$b_1[betas$id==31]*x + 
                     betas$b_2[betas$id==31]*(cos(2*pi*x/12)+sin(2*pi*x/12))
  
  #We compute the new prediction for the sales volume and store in a dataframe the results of our iteration
  if (i==1){
    Results<-as.data.frame(cbind(matrix(toString(iter), ncol = 1, nrow = nrow(salesNew)),salesNew$time,salesNew$y, as.data.frame(lapply(salesNew[1], f))))
    colnames(Results)<-c("iteration","time","y","prediction")
  } else{
    temp<-as.data.frame(cbind(matrix(toString(iter), ncol = 1, nrow = nrow(salesNew)),salesNew$time,salesNew$y, as.data.frame(lapply(salesNew[1], f))))
    colnames(temp)<-c("iteration","time","y","prediction")
    Results<-rbind(Results,temp)
  }
}
```
<br/>Having the results for each iteration of adding known values to fit the model, we can now plot the predicted values per iteration against the actual values.<br/>
```{r}
pl5       <- ggplot(Results) + 
             geom_point(aes(x=time, y=y),size=3, colour="Royalblue") +
             geom_line(aes(x=time, y=prediction,col=iteration)) +
             ylab("% Sales Volumes") +
             xlab("time")
pl5 + ggtitle("Predicted Values per Iteration vs Actual Values")

```
<br/>We also comput the sum of residual errors to be able to compare the results of each model<br/>
```{r}
Results["SR"]<-as.data.frame((Results$y-Results$prediction)^2)
SSR<-aggregate(Results$SR, by=list(iteration=Results$iteration), FUN=sum)
ggplot(SSR) + geom_point(aes(x=iteration, y=x),size=2, colour="red") + ggtitle("Convergence of SSR per Iteration")
```
<br/>We see how having more and more information of our new product help the model to have a better fit of the data and therefore have lower residual error<br/>
<br/>
We can plot the final fit of the mixed effect model with all the data available:
```{r}
f<-fitted(New_model_dt)
ggplot(data=New_Data) + geom_point(aes(x=time,y=y), color="lightseagreen", size=1) + 
  geom_line(aes(x=time,y=f), color="Royalblue") + facet_wrap(~id, ncol=6) 
```
We can see how the model manages to gather the effects of the new product and at the same time does not damages the predictions for the other products.
